Something Borrowed: Sequence Alignment and the Identification of Similar Passages in Large Text Collections

نویسندگان

  • Russell Horton
  • Mark Olsen
  • Glenn Roe
چکیده

The following article describes a simple technique to identify lexically-similar passages in large collections of text using sequence alignment algorithms. Primarily used in the field of bioinformatics to identify similar segments of DNA in genome research, sequence alignment has also been employed in many other domains, from plagiarism detection to image processing. While we have applied this approach to a wide variety of diverse text collections, we will focus our discussion here on the identification of similar passages in the famous 18th-century Encyclopédie of Denis Diderot and Jean d'Alembert. Reference works, such as encyclopedias and dictionaries, are generally expected to "reuse" or "borrow" passages from many sources and Diderot and d'Alembert's Encyclopédie was no exception. Drawn from an immense variety of source material, both French and non-French, many, if not most, of the borrowings that occur in the Encyclopédie are not sufficiently identified (according to our standards of modern citation), or are only partially acknowledged in passing. The systematic identification of recycled passages can thus offer us a clear indication of the sources the philosophes were exploiting as well as the extent to which the intertextual relations that accompanied its composition and subsequent reception can be explored. In the end, we hope this approach to "Encyclopedic intertextuality" using sequence alignment can broaden the discussion concerning the relationship of Enlightenment thought to previous intellectual traditions as well as its reuse in the centuries that followed.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

“Reuse” of Biblical Quotes in Swedish 19th Century Fiction

Dimitrios Kokkinakis and Mats Malm Introduction Multifaceted relations between texts can be complex, abstract, diverse or subtle. Digital humanists are interested in identifying pairs of text passages likely to contain substantial overlap and empirically supporting (hopefully) new interpretations of historical texts. For instance, Cordell [3] discusses how digital interpretive tools can help ma...

متن کامل

An Application of the ABS LX Algorithm to Multiple Sequence Alignment

We present an application of ABS algorithms for multiple sequence alignment (MSA). The Markov decision process (MDP) based model leads to a linear programming problem (LPP), whose solution is linked to a suggested alignment. The important features of our work include the facility of alignment of multiple sequences simultaneously and no limit for the length of the sequences. Our goal here is to ...

متن کامل

The Company They Keep: Extracting Japanese Neologisms Using Language Patterns

We describe an investigation into the identification and extraction of unrecorded potential lexical items in Japanese text by detecting text passages containing selected language patterns typically associated with such items. We identified a set of suitable patterns, then tested them with two large collections of text drawn from the WWW and Twitter. Samples of the extracted items were evaluated...

متن کامل

gpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences

Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods appli...

متن کامل

Detecting Short Passages of Similar Text in Large Document Collections

This paper presents a statistical method for fingerprinting text. In a large collection of independently written documents each text is associated with a fingerprint which should be different from all the others. If fingerprints are too close then it is suspected that passages of copied or similar text occur in two documents. Our method exploits the characteristic distribution of word trigrams,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014